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1. Introduction 



The aim of this series of papers [1-4] is to build a set of mathematical tools for studying the 
energy landscape of proteins [5,6,7], and the present paper is a step further towards this goal. 

The energy surface of proteins is the essential tool for understanding the physico-chemistry of 
basic biological processes like catalysis [7]. It is also a complex multidimensional structure that 
can only be built from the knowledge of the complete dynamical history of the molecule, which 
is currently out of reach for conventional molecular-dynamics simulations (thereafter referred as 
MDS) [7]. One reason is that in an MDS trajectory the position of every atom in the molecule 
is calculated with an accuracy of a hundredth of angstrom, which quickly overhelms even the 
most powerful computers. The approach taken here consists in encoding the small movements of 
a molecular system by means of some combinatorial structure, that allows to generate the set of 
realizable combinations of these movements. 

Within this approach, the 3-D-structures of protein molecules are encoded into binary objects 
called dominance partition sequences (DPS) [1-4], these are the generalization of a combina- 
torial structure known as noncrossing partition sequences [8]. In this context the basic structure 
for studying the molecular dynamics is the set of 3-D-conformations that have the same DPS, 
these form a connected region in molecular conformational spac^l (in what follows abridged to 
CS) called cell, thus DPSs generate a partition of CS into disjoint cells. Partitions are a useful 
tool for studying multi-dimensional spaces, in our case they systematically spann a much wider 
volume range than the set of points along a random trajectory curve generated by a MDS, they 
have also been used in many other contexts [5,6,9]. 

The aim of the preeceding papers [1-4] was to construct a graph whose nodes are the cells visited 
by the molecular system in its thermal wandering, two important properties of partition sequences 
make this construction possible : 

1. DPSs are hierarchical structures: partition sequences encoding different sets of cells can 
be merged into a new partition sequence encoding the union set, and the process can be 
repeated with the new sets of cells, thus creating a hierarchy. The importance of this 
property is that climbing the hierarchy ladder the number of cells increases exponentially 
while the sequence length increases only linearly. This compact coding makes possible 
the construction of a graph representing huge regions of CS* whose size does not exceed the 
memory of a workstation computer, while keeping at the same time the essential information 
about the molecular structures. 

2. DPSs are modular structures: partition sequences can be decomposed into subsequences 
that are embedded in different conformational subspaces. This allows to define a composi- 
tion law: if two partition sequences from two different subspaces share the same sequence 
for the intersection subspace, then joining both sequences gives a realizable sequencqj [4]. 

The first property tells us that the graph can be constructed, the second suggests how to build 
it: a molecular structure can be decomposed into sets of four atoms, its smallest 3D components, 
by composing the graphs of these one can build the graph of the molecule. 

Atoms in MDSs are represented as pointlike structures surrounded by a force field [10,11], the 

For an iV-atom molecule it is a 3 x iV-dimensional space where each point corresponds to a 3D molecular confor- 
mation. 

2 That corresponds to an existing set of cells. 



convex enveloppe of a set of 4 points in 3D-space is an irregular polytope called a 4-simplex or 
simplex^. The conformational space of these sets is relatively small with 13824 cells, of these 
only a fraction is visited by the system. With a CS so small it can be plausibly assumed that 
the accessible cells are all visited during a MDS run. 

The method for building the graph that was proposed in [2] consists in 

1. Establishing a morphological classification of simplexes, where each class is defined by a set 
of geometrical constraints. 

2. The geometrical constraints that define a class allow to calculate the set of accessible cells 
in a simplex CS [4], thus to each class we can associate a graph where the nodes are the 
cells from this set with edges towards adjacent cells. 

3. On the other hand computer simulations of protein dynamics show [2,4] that in a protein 
structure the majority of simplexes evolve within a reduced number of morphologies. For 
each 4-atom set in the molecule the graph of its CS is built by merging the graphs of the 
visited simplex morphologies. 

4. The CS graph of the molecule, that was called the graph of cells or G in [4], can be built 
by composing the CS graphs of the different simplexes. 

The graph of cells allows to enumerate exactly the set of visited cells in conformational space, but 
since the cells are encoded in a compact form unwrapping them completely is probably algorith- 
mically hopeless. Instead here we propose the construction of more manageable coarse-grained 
encodings that, using the information from G, can be recursively decomposed into progressively 
fine-grained ones. This subject is developped in the next five sections: 

• Section 2 is a graph of cells oriented description of the basic mathematical framework. 

• Section 3 is about the basic mathematical properties of G. 

• Section 4 describes how to determine, from empirical data, a conical boundary for the region 
occupied by the system in CS. 

• Section 5 shows how to decompose this cone boundary into a set of smaller cones. 

• Section 6 is devoted to describing a combinatorial sequence that encodes the conical bound- 
ary in its most compact form. 

2. The basic construction 

It was shown [1] that the conformational space of a molecule of N+l atoms could be 

described to a fair degree of accuracy by means of the partition generated by a set of hyperplanes 
passing through the origin that form a Coxeter reflection arrangemen denominated A N [8,12], 
moreover the reflections form a symmetry group that is isomorphic to the symmetric group. 

In our description of CS we have three independent arrangements one for each coordinate (x, y, z), 
i.e. A 3xN = A N xA N xA N , that generate three partitions of R 3xiV , each dividing M. N into a 
hierarchical set of regions shaped as polyhedral cones denominated cells. The hyperplanes in 

3 In what follows this denomination will be used to designate ordered sets of 4-atoms/points. 
4 N + 1 is because the translation symmetry makes one dimension spurious [1,4]. 
So called because a reflexion through one of the hyperplanes leaves the arrangement unchanged. 



our partition are defined as 

Tiij : Xi-Xj = , \<i<j<N+l (1) 
each Hij divides M. N into three regions : 

X >i < ^~i X j ^ Xj^ — X ' j <UlcL X'i ^■- > X j 

in the first case we say that Xj dominates Xj, in the second case neither Xj nor Xj dominates, in 
the last case X{ dominates Xj. As cells are bounded by the hyperplanes (1) a consequence of (2) 
is that the points inside a given cell (in x, y or z) have the following property: 

^ X 'l2 — X l3 — •■• — X tN-2 — X lN-l — "^ijV (^) 

where the sequence (ii, i2, 13, ...1^-2, ijv-i, *jv) is a permutation of the set iJ/v+i = (1, 2, 3, ... A/", iV+ 
1), reflecting a point through Tiij is equivalent to permute the coordinates i and j [8]. Thus a 
cell where a strict " less than" relation holds for every pair of coordinates in (3) is encoded by the 
dominance sequence 

(h ) («2 ) (*3 ) • • ■ (i n-2 ) (i N-i ) (i n ) (4a) 

while for a cell where Xj a = = ... = for r+1 consecutive indices (ia^oH-i, ■•■^oH-r) i n (3) 

will be encoded by the dominance sequence 

(k ) («2 ) (*3 ) • • ■ (ia Wi ■ • • W ) • • ■ JV-i ) («jv) N+i ) (4b) 

the first (4a) represents an iV-dimensional cell while (4b) is a (iV— r)-dimensional cell because it 
corresponds to the intersection of the hyperplanes Tiij with i,j G (i a i a ±\...i ot \ r ). 

Definition 1. The position of a coordinate x c { in a cell of dimension N is the position of the 
index i in the dominance sequence of c. 

An alternative encoding of cells is by means of an iVxAf antisymmetric sign matrix S c , where 
c stands for x, y or z. Let 1 < i < j < N+l, then for an arbitrary point x the matrix elements 
S c for the c coordinates are defined: 

S tj = " if x i < x j 

Sf j = rtx<i = x) (5) 
Si 3 = + \ix\>x) 

As it was explained in [1,4] a direct consequence of (3) is that S c can be interpreted as the 
incidence matrix of a digraph with no directed cycles, and the cell encodings (3) and (5) can be 
readily interconverted into one another 

Lemma 1. Contiguous cells in space have different dimensionalities. 

Crossing to a contiguous cell implies going between two regions in (2), so one element Sfj in 
(5) changes its value, and this change can never be between + and — because this would mean 
crossing Hfj avoiding the region q = Cj. 

Definition 2. A contiguous set are all the n-dimensional cells contiguous to a (n — 1)- 
dimensional separator cell. 

This allows to build a hierarchical structure: the cell lattice poset, that results from ordering 
contiguous cells by dimensionality [1,13]. 

Consider two arbitrary subpartitions A d ^ and A^ b of A N , corresponding to the sets of indices 



Xa = (V,V2! -V.+i) c ^ da+1 and = (*6i,*62) c 2:4+1 respectively, and let Xanfe = 

Xa (~l Xb be the set of indices that are common to both partitions. 

Definition 3. Two cells ( a G A^ a and G .A^ 6 mtt sipn matrices S a and S b respectively, are 
said to be compatible if S^j = V i,j G Xanfe- 

Lemma 2. XTie ceZZ £a £ if" is i/ie projection of all the cells in A N whose sign matrix S is such 
that Sij = Sfj V i,j £Xa- 

This is an inmediate consequence of (3) and (5). 

Let H a and H& be the set of cells in A N that are projected on ( a and Cfc respectively 
Lemma 3. The set H a n Ej, is non empty iff (a an d Cb aT & compatible. 

Suppose we have £ G E a but £ H&, this means that the relative positions of the set of indices 
Xb\a = Xb\Xa in the dominance sequence (4) is not the same as in (b, since the reflexion group 
of the arrangement is the symmetric group there always will be a set of permutations/reflections 
that sorts the indices Xb\a m the dominance sequence in the same order as in £&, this generates 
a cell f ' G E a n S 6 . 

3. The graph of cells 

Lemmas 2 and 3 suggest that A 3xN can be built by merging partitions of lower dimensionality. 
The smallest 3D system is a set of 4 atoms, and i 3x4_1 , the partition of its CS, has exactly 
13824 cells, a computational complexity within the range of a desktop computer. Moreover, as 
stated in the introduction it can be reasonably assumed that such small CS can be thoroughly 
scanned by a MDS. 

Following the procedure proposed in refs. [2,3,4] (outlined in the introduction) we can build the 
CS of a molecular system from the CS of the simplexes. For this, we need to construct the graph 
of cells or G which is defined as follows: 

Definition 4. Two simplexes are adjacent if they share a face. 

Definition 5. The nodes of G are the visited cells of each simplex with edges towards the 
compatible cells of adjacent simplexes. 

Definition 6 A tranversal is a subgraph of G whith nodes exactly one cell from every simplex 
such that every two cells from adjacent simplexes are compatible. 

G embodies all the information contained in the CS" of a molecular system since 

Theorem 1. The cells in a tranversal are the projections of a single cell in CS 

By lemma 3 the cells in the transversal are the projection of at least one cell in A 3xN , that cell 
is unique because if there were two, for instance, their sign matrices would not be the same, say 
that the element is different, then there is a set of ( Ar ^ 1 ) simplexes that harbor the indices i 
and j and within this set each simplex is adjacent to 2 x (N — 3) other simplexes, from definition 
5 adjacent simplexes have to be compatible and the element ij in their sign matrix must be the 
same for all, invalidating our assumption. 

Corollary. In G a node that fails to form an edge with an adjacent simplex cannot exist since 
it is geometrically inconsistent. 



A useful structure derived from G is its compact form C obtained by recursively substituing 
every contiguous set of n-dimensional nodes by their (n— l)-dimensional separator cell. 

Finally a cell from A 3xN is a class in an equivalence relation, since it contains all the 3-D-structures 
that have the same dominance sequence. In what follows we use the terms cell and 3D-structure 
interchangeably. 

4. Determining a conical boundary for the molecular dynamics trajectory 

G is a huge structure and it is probably useless to try to explore it in full, rather the approach we 
take here is how to focus on regions (subgraphs) where we can expect to extract useful information. 
We start with the problem of finding the bounds of interesting regions, with a concrete exemple 
concerning a 2.1 ns pancreatic trypsin inhibitor (PTI) [14] MDS that was fully described in [15]. 

As in [15] we restrict ourselves to study the motion of C™ carbons each bearing a number n 
that reflects the linear order of residues along the polypeptide chain, as our description of CS is 
strictly modular any conclusion that can be drawn on any subset of atoms is automatically valid 
for the whole structure. 

An information easily extracted from a MDS are the dominance relations matrices DR C , where 
c stands for either x, y or z, each element of these matrices defines the equation of a face 
in a polyhedral cone, it encloses the region that the molecular system occupies in CS. The 
determination of the DR c s from the MDS [15] takes the following steps: 

• First, the simplex corresponding to the residue numbers S r = {6, 36, 40, 47} was selected as 
the reference simplex because all along the MDS it stays within one morphological class, 
and because it spans a wide volume across the molecule. 

• Second, the coordinates of S r in the 1 st MD frame were taken as a reference and the other 
frames were rotated and translated so that the RMS between S r (l) and S r (/) be a minimum 
[16]. 

• Third, the quantities DR^ , 1 < i < j < N+l, were determined 

— DRf- = + , DR C -- = — if CI > Cl<„ for all coordinate frames. 

J J *-*c 

— DR% = — , DR% = + if Ct < Ca n for all coordinate frames. 

— DR?j = DRj { = if neither of the above relations holds. Also, by convention DRf { = 0. 

The meaning of the matrix elements is obvious, if DR^- = +/— the trajectory always stays on 
the positive/negative side of Ti^ (2), for DR^- = the trajectory can be on either side of 'Hfj- 
The matrices for x, y and z for the MDS [15] are shown in Figure 1, the number of non-zero 
terms in the matrix is the dimension of the cone. 

Lemma 4. The minimum position min^ of a coordinate c M is the number of matrix elements 
DR C j = + plus 1 , l<j<N,j^=fj,, and the maximum position max^ is the minimum 
position plus the number of matrix elements DR° ■ = , 1 < j < N. 

5. The fragmentation of the cone 

The dominance relations matrices DR C encode a lot of information about the structure of the 
volume occupied by the system in CS. They give us the range of positions of a given coordinate 
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Antisymmetric dominance relations matrices for the C a coordinates, only the upper triangle is shown. For sake of 
clarity row and column amino acid numbers can be read from the annotated axes r and c. A matrix element can 
have three values 

+ x r > x c for all coordinate frames in the molecular dynamics run. 
- x r < x c for all coordinate frames in the molecular dynamics run. 
neither of the above relations holds. 

in the dominance sequence (3). 

The index \x in the dominance sequence must always stay to the right of the elements it dominates 
if there are n+ of such elements the minimum position of fx is n + + l, on the other hand be no 



the number of indiferent relations, fi can be either to the right or to the left of any of these then 
the maximum position of fx must be n + +no + l. 

A set of constraints can be defined for the cone 

We can also extract from DR C sets of lower dimensional cells, these are useful for fragmenting G 
into subgraphs of more manageable size. To do this we can proceed as follows: we select indices 
H and v such that 

Vc £ {x, y, z} : max^j > min£ , max£ > min^ and 

MIN(max£, max£) - MAX(min^, mm c u ) > h c (6) 

with h c = 1, 2, 1 for DR C = — 1, 0, 1 respectively. 

we thus select pairs of atomic indices fi and v whose ranges overlap in x, y and z simultane- 
ously with intersection length > (h x ,h y ,h z ) respectively. For every pair index their ranges in 
any dimension are divided into three segments: left, middle (the intersection) and right; fi, for 
instance, can occupy any position in the left and middle segments, while v can be in the middle 
and right ones, this makes a total of 3 possibilities, 4 if DR C = in which case [i and v can be 
simultaneously in the middle segment. Obviously this can be extended to more than 2 indices: if 
[iv, [ilo and vuj have overlapping ranges, for instance, then there is a common overlapping range 
for ju, v and oj too, which in turn gives segmentation and occupation patterns for jivoj. 
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Table I 

The complete sets of indices for the a-carbons of the MDS described in [15] that conform to (6). 



The importance of overlapping indices is twofold: 



1. A set of molecular conformational states can be determined from them using a minimum 
number of cells from G: the indices being the same for x, y and z makes that occupa- 
tion patterns for overlapping fj, and v, for instance, can be deduced from the cells in G 
corresponding to the simplexes that bear these indices. 

2. One can address the basic problem of how occupation states in the dominance sequence are 
correlated between different coordinates. 

The set of allowed overlapping indices than can be deduced from the DR C matrices in Figure 1 
is given in table I. 

This allows a procedure for fragmenting the cone in Figure 1 into smaller ones. From G we can 
deduce for each set of indices from Table I a number of local conformations, the valid combinations 
of these conformations will give us smaller cones whose cells have DPSs with mean positions 
(x, y, z) much closer to the values of the cells in G. 



6. A combinatorial sequence for encoding cones. 

The codification of cones in conformational space could be much simplified by introducing a 
simple extension in the formalism used for encoding dominance sequences : we allow expressions 
enclosed between parenthesis to overlap, and we distinguish between pairs of enclosing parenthesis 
by numbering them. Let us assume, for example, that we have a cone in CS whith DR matrix 

1 3 4 7 8 9 

1 + + 

3 - 0-00 

4-0 -00. (7) 

7 + + 00 

8 

9 

The sequence 

(3 4 (8 9) 1 7) (8) 

is meant to encode in one formula the sequences (3 4 8 9)(1 7) and (3 4)(1 7 8 9), these represent 
the totality of cells from CS that lie inside the cone (7); structures like (8) will be designated 
as: generalized compact dominance sequences (GCDS). Notice that parenthesis enclosed 
within parenthesis are not allowed within GCDSs since they are meaningless as dominance 
sequences. 

GCDSs can encode huge numbers of cells from CS, for instance the first ten a-carbons in our 
structure [14,15] evolve within a cone in A 3xW encoded by the formula 

{{(1 (5 8 (6) (2 (9) 7) 10) 3 4)} x , 

1 12233445 65 6 7 87 8 

{(10) (9M8) (7) (5 (4) 6) (2 (3) 1)} V , 
{(8 (7 10) 6 9) (3 4 5 (1) 2)} z } (9) 
that can be easily checked by comparing it with the 10 x 10 upper-left submatrices in Fig. 1. 



Not all the cones in CS can be represented by GCDSs. A simple example will show us that the 
x-component of (9) can not be extended beyond the 14 th C a . Let 
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be the DR X submatrix of the a-carbons 2, 3, 4, 10, 12 and 15, it gives the sequence 
(2 (10) (3 4 15) 12) (11) 

which is clearly inconsistent because 12 dominates 3 and 4 but is on the same dominance level 
with 10. One can perform slight modifications in (10) that transform (11) into a valid compact 
formula: setting to zero the circled components in (10) gives the DR X matrix of a GCDS-cone 
that encloses the cone (10). This modification allows us to extend the generalized dominance 
sequence x-component of (9) to the 20 a-carbons 

1 12 3243 5674895 6 7 10 8 9 11 10 12 11 12 

{(19) (20 (18) (1) 17 (5 (8 (6) (2 (9) 11 16) 7) (10) 3 4) 15 (12) (14) 13)} a (12) 

One can easily verify for every coordinate from Fig. 1 that the cone that bounds the evolution 
in CS of every 10 consecutive residues of the molecular structure is a GCDS-cone, for chains 
about 20 residues and more this is generally no longer possible unless the value of some DR C 
elements are made zero as in (10). This result would seem to suggest that in the MDS from [15] 
thermodynamic equilibrium has not been attained, for instance in (10) C^ 10 can swap dominance 
with C* 3 , C^ 4 , C„ 7 and C„ 15 , but pairs of cells with conformations where C* 3 , C^ 4 and C% 7 cross 
one another on the x-axis have not been visited by the MDS. This example clearly shows that 
GCDS-cones not only have a simple elegant formula to describe them but also they maximize 
the number of available states (i.e. entropy), both properties make them very convenient tools 
for studying CS. 

By setting to zero a minimum number of DR c s: 27 in x, 6 in y and 91 in z (1.9%, 0.4% and 6.6% 
respectively). We obtain a GCDS-cone for the whole a-carbon chain 



1234 5126734856 9 10 11 7 12 89 

{{(49 (48 (29 (27 28 30 (31) 52) 47 (32 (53) 50) (26) 51) 21 23 24 (19 (20 (25 33) 46 55 (54) 22) 

10 11 13 12 14 15 16 13 17 14 18 15 16 19 20 17 21 18 22 19 23 20 
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This sequence sets the boundary for the molecular dynamics trajectory in [15] in a compact 
forrr|§. 



7. Conclusion 

This paper is an outline of a methodology for the exploration of CS. 

In [1-4] it was assumed that the small local movements of a molecule can be thoroughly sampled 
in a MDS, and a procedure was devised for building the whole set of structures that result from 
the combinations of these small movements. The result is a combinatorial structure called the 
graph of cells, that gives a global view of a molecular system dynamical conformations. 

Although the graph of cells can be fitted in a desktop computer file it encodes a huge amount of 
structures, the present paper is a first step in solving the problem of managing this great quantity 
of information. Three issues have been addressed: 

1. we can give bounds that delimit interesting regions (cones) in CS, 

2. these cones can be decomposed into a set of smaller ones, 

3. it is shown that cones in CS can be described by a combinatorial sequence 

This last structure, the generalized compact dominance sequence, has embedded in it the whole 
set of dominance sequences that are in a cone, and can be hierarchically decomposed into a poset 
structure. On the other hand the graph of cells can be seen as a set of constraints between the 
x, y and z components of the allowed dominance sequences, then the GCDSs and the graph 
of cells complement each other beautifully, since the conformations of the molecular system can 
obtained by prunning the poset structure from the GCDS with the constraints from the graph 
of cells. Moreover, GCDSs also have a graphical structure where paths and graphical distances 
between cells (or 3D-structures) can be determined, and graphical distances between atoms in a 
3-D-structure can be enumerated as well. That makes GCDSs well suited as the base structures 
for the development of a combinatorial Hamiltonian in conformational space. 

These issues will be further explored in forthcoming works of this series. 



a-carbons from end-residues 1, 2, 56, 57 and 58 are not included because they add disorder, unnecessarily 
augmenting the volume of the cone without adding much information. 
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